In part 2, we will continue analyzing the potential relationships between variables that we have interest in. We will conduct 12 statistical analyses in total: 2 Chi-square tests to explore association between two categorical variables(Frisk VS Search, Officer in uniform VS Suspect arrested); 4 linear regressions to find out whether linear relation exists in two numeric variables(Observe time VS Stop time, Stop time VS Suspect reported age, Observe time VS Suspect height, Stop time VS Suspect weight); 6 Comparisons of Means T-test to see whether there is statistical difference between two specific groups in means of the variable studied(means of observe mins/stop mins of 2 different eye colors, means of observe mins/stop mins of thin suspects and heavy suspects, means of observe mins/stop mins of female suspects and male suspects).

We also develop our own methods to define and exclude outliers(since there are extreme large values in our studied variables), and use that to conduct the analysis again after using complete data to do first. Sometimes there is no statistical result found in the complete data, but we got some interesting results after excluding the extreme outliers.

# Set up: load the NYPD dataset and the packages that we will use
NYPD <- read.csv("Data/2018_sqf_database-abbr.csv")

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.9
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
head(NYPD)
##   STOP_FRISK_ID Stop.Frisk.Time YEAR2  MONTH2   DAY2      STOP_WAS_INITIATED
## 1             1        19:04:00  2018 January Monday   Based on C/W on Scene
## 2             2        23:00:00  2018 January Monday      Based on Radio Run
## 3             3        23:55:00  2018 January Monday      Based on Radio Run
## 4             4         3:23:00  2018 January Monday      Based on Radio Run
## 5             5         3:23:00  2018 January Monday      Based on Radio Run
## 6             6        21:20:00  2018 January Monday Based on Self Initiated
##   ISSUING_OFFICER_RANK ISSUING_OFFICER_COMMAND_CODE SUPERVISING_OFFICER_RANK
## 1                  POM                            1                      SGT
## 2                  POM                           34                      SGT
## 3                  POM                          808                      SGT
## 4                  POM                           63                      SGT
## 5                  POM                           63                      SGT
## 6                  POM                           71                      SGT
##   SUPERVISING_OFFICER_COMMAND_CODE OBSERVED_DURATION_MINUTES
## 1                                1                         0
## 2                               34                         1
## 3                              808                         0
## 4                               63                         2
## 5                               63                         2
## 6                               71                         1
##     SUSPECTED_CRIME_DESCRIPTION STOP_DURATION_MINUTES
## 1                      MENACING                    18
## 2                           CPW                    15
## 3                 GRAND LARCENY                    10
## 4                       ROBBERY                    15
## 5                       ROBBERY                    15
## 6 UNAUTHORIZED USE OF A VEHICLE                    15
##   OFFICER_EXPLAINED_STOP_FLAG OFFICER_NOT_EXPLAINED_STOP_DESCRIPTION
## 1                           1                                      0
## 2                           1                                      0
## 3                           1                                      0
## 4                           1                                      0
## 5                           1                                      0
## 6                           1                                      0
##   SUSPECT_ARRESTED_FLAG SUSPECT_ARREST_OFFENSE OFFICER_IN_UNIFORM_FLAG
## 1                     0                      0                       1
## 2                     0                      0                       1
## 3                     0                      0                       1
## 4                     1                ROBBERY                       1
## 5                     1                ROBBERY                       1
## 6                     0                      0                       1
##   FRISKED_FLAG SEARCHED_FLAG OTHER_CONTRABAND_FLAG FIREARM_FLAG
## 1            1             0                     0            0
## 2            1             1                     0            0
## 3            1             0                     0            0
## 4            1             1                     0            0
## 5            1             1                     0            0
## 6            1             0                     1            0
##   KNIFE_CUTTER_FLAG OTHER_WEAPON_FLAG WEAPON_FOUND_FLAG PHYSICAL_FORCE_CEW_FLAG
## 1                 0                 0                 0                       0
## 2                 0                 0                 0                       0
## 3                 0                 0                 0                       0
## 4                 0                 0                 0                       0
## 5                 0                 0                 0                       0
## 6                 0                 0                 0                       0
##   PHYSICAL_FORCE_DRAW_POINT_FIREARM_FLAG PHYSICAL_FORCE_HANDCUFF_SUSPECT_FLAG
## 1                                      0                                    0
## 2                                      1                                    0
## 3                                      0                                    0
## 4                                      0                                    0
## 5                                      0                                    0
## 6                                      0                                    0
##   PHYSICAL_FORCE_OC_SPRAY_USED_FLAG PHYSICAL_FORCE_OTHER_FLAG
## 1                                 0                         0
## 2                                 0                         0
## 3                                 0                         0
## 4                                 0                         0
## 5                                 0                         0
## 6                                 0                         0
##   PHYSICAL_FORCE_RESTRAINT_USED_FLAG PHYSICAL_FORCE_VERBAL_INSTRUCTION_FLAG
## 1                                  0                                      1
## 2                                  0                                      0
## 3                                  0                                      1
## 4                                  1                                      1
## 5                                  1                                      1
## 6                                  0                                      1
##   BACKROUND_CIRCUMSTANCES_VIOLENT_CRIME_FLAG
## 1                                          0
## 2                                          0
## 3                                          1
## 4                                          1
## 5                                          1
## 6                                          0
##   BACKROUND_CIRCUMSTANCES_SUSPECT_KNOWN_TO_CARRY_WEAPON_FLAG
## 1                                                          0
## 2                                                          0
## 3                                                          0
## 4                                                          0
## 5                                                          0
## 6                                                          0
##   SUSPECTS_ACTIONS_CASING_FLAG
## 1                            0
## 2                            0
## 3                            0
## 4                            0
## 5                            0
## 6                            0
##   SUSPECTS_ACTIONS_CONCEALED_POSSESSION_WEAPON_FLAG
## 1                                                 1
## 2                                                 0
## 3                                                 0
## 4                                                 0
## 5                                                 0
## 6                                                 0
##   SUSPECTS_ACTIONS_DECRIPTION_FLAG SUSPECTS_ACTIONS_DRUG_TRANSACTIONS_FLAG
## 1                                1                                       0
## 2                                1                                       0
## 3                                1                                       0
## 4                                1                                       0
## 5                                1                                       0
## 6                                1                                       0
##   SUSPECTS_ACTIONS_IDENTIFY_CRIME_PATTERN_FLAG SUSPECTS_ACTIONS_OTHER_FLAG
## 1                                            0                           0
## 2                                            0                           0
## 3                                            0                           0
## 4                                            0                           1
## 5                                            0                           1
## 6                                            0                           0
##   SUSPECTS_ACTIONS_PROXIMITY_TO_SCENE_FLAG SEARCH_BASIS_ADMISSION_FLAG
## 1                                        1                           0
## 2                                        0                           0
## 3                                        1                           0
## 4                                        0                           0
## 5                                        0                           0
## 6                                        1                           0
##   SEARCH_BASIS_CONSENT_FLAG SEARCH_BASIS_HARD_OBJECT_FLAG
## 1                         0                             0
## 2                         0                             0
## 3                         0                             0
## 4                         0                             0
## 5                         0                             0
## 6                         0                             0
##   SEARCH_BASIS_INCIDENTAL_TO_ARREST_FLAG SEARCH_BASIS_OTHER_FLAG
## 1                                      0                       0
## 2                                      0                       1
## 3                                      0                       0
## 4                                      0                       1
## 5                                      0                       1
## 6                                      0                       0
##   SEARCH_BASIS_OUTLINE_FLAG DEMEANOR_OF_PERSON_STOPPED SUSPECT_REPORTED_AGE
## 1                         0                COOPERATIVE                    0
## 2                         0                    NERVOUS                   26
## 3                         0                   RELIEVED                   40
## 4                         0                   COMPLAIN                   38
## 5                         0                  COMPLAINT                   36
## 6                         0                    NERVOUS                   20
##   SUSPECT_SEX       SUSPECT_RACE_DESCRIPTION SUSPECT_HEIGHT SUSPECT_WEIGHT
## 1        MALE                          WHITE            5.1            170
## 2        MALE                 BLACK HISPANIC            6.1            250
## 3        MALE                 WHITE HISPANIC            5.5            150
## 4        MALE AMERICAN INDIAN/ALASKAN NATIVE            5.1            160
## 5        MALE                          WHITE           5.11            230
## 6        MALE                          BLACK            5.1            150
##   SUSPECT_BODY_BUILD_TYPE SUSPECT_EYE_COLOR SUSPECT_HAIR_COLOR
## 1                     MED               BRO                BLK
## 2                     HEA               BRO                BLK
## 3                     THN               BRO                BLD
## 4                     MED               BRO                BLK
## 5                     MED               BRO                BLK
## 6                     THN               BRO                BLK
##   STOP_LOCATION_PRECINCT       STOP_LOCATION_FULL_ADDRESS
## 1                      1 VARICK STREET && FRANKLIN STREET
## 2                     34    DYCKMAN STREET && POST AVENUE
## 3                     43              2245 RANDALL AVENUE
## 4                     63       EAST 38 STREET && AVENUE L
## 5                     63       EAST 38 STREET && AVENUE L
## 6                     67             178 ROCKAWAY PARKWAY
##   STOP_LOCATION_BORO_NAME
## 1               MANHATTAN
## 2               MANHATTAN
## 3                   BRONX
## 4                BROOKLYN
## 5                BROOKLYN
## 6                BROOKLYN
### Linear Regression 1: Observe Time VS. Stop Time


## First roughly observe the linear relationship between Observe Mins and Stop Mins
# scatter plot of observe time(x) vs. stop time(y), done above -- seems not very linear
ggplot(NYPD, aes(OBSERVED_DURATION_MINUTES, STOP_DURATION_MINUTES)) + geom_point() + labs(title="The Relation between Observed Time and Stop Time", subtitle="(with all data)", y="stop duration mins", x="observed duration mins")

# calculate the correlation -- seems a moderate/weak positive correlation
corr_observet_stopt <- cor(NYPD$OBSERVED_DURATION_MINUTES,NYPD$STOP_DURATION_MINUTES)
print(paste("correlation between observe mins and stop mins is",corr_observet_stopt))
## [1] "correlation between observe mins and stop mins is 0.505381065226995"
# generate a line of best fit
ggplot(NYPD, aes(OBSERVED_DURATION_MINUTES, STOP_DURATION_MINUTES)) + geom_point() + geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'

## Build the linear regression model and check whether the residuals fit the assumptions
observet_stopt_model <- lm(STOP_DURATION_MINUTES ~ OBSERVED_DURATION_MINUTES, data = NYPD)
print(observet_stopt_model)
## 
## Call:
## lm(formula = STOP_DURATION_MINUTES ~ OBSERVED_DURATION_MINUTES, 
##     data = NYPD)
## 
## Coefficients:
##               (Intercept)  OBSERVED_DURATION_MINUTES  
##                 11.372070                   0.009779
# generate histogram, normal probability plot of residuals and a residual plot -- the normal probability plot is not straight at the right end, and the residuals are not very randomly scattered, so the linear regression model may not be appropriate in this case. Or it could be mainly due to extremely large outliers.
residuals<-resid(observet_stopt_model)
hist(residuals)

qqnorm(residuals)

plot(fitted(observet_stopt_model), residuals)

# calculate and plot the standard residuals:
standard_residuals <- rstandard(observet_stopt_model)
plot(fitted(observet_stopt_model), standard_residuals)

## Data Extraction: exclude outliers whose standard residuals are greater than 2:
observet_stopt <- NYPD[c("OBSERVED_DURATION_MINUTES", "STOP_DURATION_MINUTES")]
observet_stopt_stdres <- cbind(observet_stopt, standard_residuals)
dim(observet_stopt_stdres)
## [1] 11008     3
observet_stopt_analysis <- observet_stopt_stdres[observet_stopt_stdres$standard_residuals < 2,]
dim(observet_stopt_analysis)
## [1] 10763     3
## Summary of Linear Regression Analysis: the p-value for the slope coefficient is much higher than 0.05, so there is no statistical evidence that there is a linear relationship between observed time and stop time
summary(lm(STOP_DURATION_MINUTES ~ OBSERVED_DURATION_MINUTES, data = observet_stopt_analysis))
## 
## Call:
## lm(formula = STOP_DURATION_MINUTES ~ OBSERVED_DURATION_MINUTES, 
##     data = observet_stopt_analysis)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.791 -4.791 -1.791  3.209 34.209 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                9.7908003  0.0762773  128.36   <2e-16 ***
## OBSERVED_DURATION_MINUTES -0.0002305  0.0002993   -0.77    0.441    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.908 on 10761 degrees of freedom
## Multiple R-squared:  5.513e-05,  Adjusted R-squared:  -3.779e-05 
## F-statistic: 0.5933 on 1 and 10761 DF,  p-value: 0.4412
### We would do the linear regression analysis again by excluding the outliers in Observe Time and Stop Time based on the methods we defined previously
### Outliers: Observe mins >35, Stop mins >50

exclude_observet_stopt <- NYPD[NYPD$OBSERVED_DURATION_MINUTES <=35 & NYPD$STOP_DURATION_MINUTES <=50,]
dim(exclude_observet_stopt)
## [1] 10795    59
dim(NYPD)
## [1] 11008    59
# Build the linear regression model and check whether the residuals fit the assumptions
exclude_observet_stopt_model <- lm(STOP_DURATION_MINUTES ~ OBSERVED_DURATION_MINUTES, data = exclude_observet_stopt)
print(exclude_observet_stopt_model)
## 
## Call:
## lm(formula = STOP_DURATION_MINUTES ~ OBSERVED_DURATION_MINUTES, 
##     data = exclude_observet_stopt)
## 
## Coefficients:
##               (Intercept)  OBSERVED_DURATION_MINUTES  
##                    9.6704                     0.2007
# generate histogram, normal probability plot of residuals and a residual plot 
residuals<-resid(exclude_observet_stopt_model)
hist(residuals)

qqnorm(residuals)

plot(fitted(exclude_observet_stopt_model), residuals)

# calculate and plot the standard residuals:
standard_residuals <- rstandard(exclude_observet_stopt_model)
plot(fitted(exclude_observet_stopt_model), standard_residuals)

## Data Extraction: exclude outliers whose standard residuals are greater than 2:
exclude_observet_stopt <- exclude_observet_stopt[c("OBSERVED_DURATION_MINUTES", "STOP_DURATION_MINUTES")]
exclude_observet_stopt_stdres <- cbind(exclude_observet_stopt, standard_residuals)
dim(exclude_observet_stopt_stdres)
## [1] 10795     3
exclude_observet_stopt_analysis <- exclude_observet_stopt_stdres[exclude_observet_stopt_stdres$standard_residuals < 2,]
dim(exclude_observet_stopt_analysis)
## [1] 10171     3
## Summary of Linear Regression Analysis: the p-value for the slope coefficient is very small(much smaller than 0.05), so there is statistical evidence that stop mins is positively linear ro observed mins, after we excluding the outliers. The slope coefficient is about 0.25.
summary(lm(STOP_DURATION_MINUTES ~ OBSERVED_DURATION_MINUTES, data = exclude_observet_stopt_analysis))
## 
## Call:
## lm(formula = STOP_DURATION_MINUTES ~ OBSERVED_DURATION_MINUTES, 
##     data = exclude_observet_stopt_analysis)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.630  -3.813  -1.843   1.944  18.439 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                8.05569    0.06622  121.65   <2e-16 ***
## OBSERVED_DURATION_MINUTES  0.25249    0.01420   17.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.937 on 10169 degrees of freedom
## Multiple R-squared:  0.03015,    Adjusted R-squared:  0.03005 
## F-statistic: 316.1 on 1 and 10169 DF,  p-value: < 2.2e-16
cor(exclude_observet_stopt_analysis$OBSERVED_DURATION_MINUTES,exclude_observet_stopt_analysis$STOP_DURATION_MINUTES)
## [1] 0.1736249
### Linear Regression 2: Stop Time VS. Suspect Reported Age


## First roughly observe the linear relationship between Stop Mins and Suspect Reported Age
# scatter plot of suspect reported age(x) vs. stop time(y), done above -- it is not linear by observation
ggplot(NYPD, aes(SUSPECT_REPORTED_AGE, STOP_DURATION_MINUTES)) + geom_point() + labs(title="Relation between Suspect Reported Age and Stop Time ", y="stop duration mins", x="suspect reported age")

# Here we use the same way as previously in part 1 to define and exclude outliers based on context; outliers: stop mins >50
sample_age_stopt <- NYPD[!NYPD$SUSPECT_REPORTED_AGE==0 & NYPD$STOP_DURATION_MINUTES<=50,]

# calculate the correlation -- seems a very weak positive correlation
corr_age_stopt <- cor(sample_age_stopt$SUSPECT_REPORTED_AGE, sample_age_stopt$STOP_DURATION_MINUTES)
print(paste("correlation between suspect reported age and stop mins is",corr_age_stopt))
## [1] "correlation between suspect reported age and stop mins is -0.00897348201582471"
# generate a line of best fit -- the line is approximately horizontal, meaning stop mins may not change a lot as age varies
ggplot(sample_age_stopt, aes(SUSPECT_REPORTED_AGE, STOP_DURATION_MINUTES)) + geom_point() + geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'

## Build the linear regression model and check whether the residuals fit the assumptions
age_stopt_model <- lm(STOP_DURATION_MINUTES ~ SUSPECT_REPORTED_AGE, data = sample_age_stopt)
print(age_stopt_model)
## 
## Call:
## lm(formula = STOP_DURATION_MINUTES ~ SUSPECT_REPORTED_AGE, data = sample_age_stopt)
## 
## Coefficients:
##          (Intercept)  SUSPECT_REPORTED_AGE  
##            10.412864             -0.006322
# generate histogram, normal probability plot of residuals and a residual plot -- the histogram is not normally distributed and the normal probability plot is not approximately a straight line, so the linear regression model may not be appropriate in this case.
residuals<-resid(age_stopt_model)
hist(residuals)

qqnorm(residuals)

plot(fitted(age_stopt_model), residuals)

# Calculate and plot the standard residuals
standard_residuals <- rstandard(age_stopt_model)
plot(fitted(age_stopt_model), standard_residuals)

length(standard_residuals)
## [1] 10056
length(sample_age_stopt$STOP_DURATION_MINUTES)
## [1] 10056
## Data Extraction: exclude outliers whose standard residuals are greater than 2:
age_stopt <- sample_age_stopt[c("SUSPECT_REPORTED_AGE", "STOP_DURATION_MINUTES")]
age_stopt_stdres <- cbind(age_stopt, standard_residuals)
dim(age_stopt_stdres)
## [1] 10056     3
age_stopt_analysis <- age_stopt_stdres[age_stopt_stdres$standard_residuals < 2,]
dim(age_stopt_analysis)
## [1] 9399    3
## Summary of Linear Regression Analysis: the p-value for the slope coefficient is 0.03, which is less than 0.05, so we could reject the null hypothesis here in favor of alternative hypothesis. That is, there is statistical significance that suspect reported age is negatively linear to stop time.
summary(lm(STOP_DURATION_MINUTES ~ SUSPECT_REPORTED_AGE, data = age_stopt_analysis))
## 
## Call:
## lm(formula = STOP_DURATION_MINUTES ~ SUSPECT_REPORTED_AGE, data = age_stopt_analysis)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.706 -3.674 -1.621  1.706 18.727 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           8.853653   0.155688  56.868   <2e-16 ***
## SUSPECT_REPORTED_AGE -0.010552   0.004893  -2.156   0.0311 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.833 on 9397 degrees of freedom
## Multiple R-squared:  0.0004946,  Adjusted R-squared:  0.0003883 
## F-statistic:  4.65 on 1 and 9397 DF,  p-value: 0.03107
### Linear Regression 3: Observe Time VS. Suspect Height


## First roughly observe the linear relationship between Observe Mins and Suspect Height
# scatter plot of height(x) vs. observe time(y), done above -- it is not linear by observation
ggplot(NYPD, aes(SUSPECT_HEIGHT, OBSERVED_DURATION_MINUTES)) + geom_point() + labs(title="Relation between Suspect Height and Observed Time ", y="observed duration mins", x="suspect height")

# Here we use the same way as previously in part 1 to define and exclude outliers based on context; outliers: observed mins >35
NYPD$SUSPECT_HEIGHT <- as.numeric(NYPD$SUSPECT_HEIGHT)
## Warning: NAs introduced by coercion
sample_h_observet <- NYPD[!is.na(NYPD$SUSPECT_HEIGHT) & NYPD$OBSERVED_DURATION_MINUTES<=35,]

# calculate the correlation -- seems a very weak negative correlation
corr_h_observet <- cor(sample_h_observet$SUSPECT_HEIGHT, sample_h_observet$OBSERVED_DURATION_MINUTES)
print(paste("correlation between suspect height and observed mins is",corr_h_observet))
## [1] "correlation between suspect height and observed mins is -0.00223632383257302"
# generate a line of best fit -- the line is approximately horizontal, meaning observed mins may not change a lot as height varies
ggplot(sample_h_observet, aes(SUSPECT_HEIGHT, OBSERVED_DURATION_MINUTES)) + geom_point() + geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'

## Build the linear regression model and check whether the residuals fit the assumptions
h_observet_model <- lm(OBSERVED_DURATION_MINUTES ~ SUSPECT_HEIGHT, data = sample_h_observet)
print(h_observet_model)
## 
## Call:
## lm(formula = OBSERVED_DURATION_MINUTES ~ SUSPECT_HEIGHT, data = sample_h_observet)
## 
## Coefficients:
##    (Intercept)  SUSPECT_HEIGHT  
##        2.23066        -0.02331
# generate histogram, normal probability plot of residuals and a residual plot -- the histogram is not normally distributed and the normal probability plot is not approximately a straight line, so the linear regression model may not be appropriate in this case.
residuals<-resid(h_observet_model)
hist(residuals)

qqnorm(residuals)

plot(fitted(h_observet_model), residuals)

# Calculate and plot the standard residuals
standard_residuals <- rstandard(h_observet_model)
plot(fitted(h_observet_model), standard_residuals)

## Data Extraction: exclude outliers whose standard residuals are greater than 2:
h_observet <- sample_h_observet[c("SUSPECT_HEIGHT", "OBSERVED_DURATION_MINUTES")]
h_observet_stdres <- cbind(h_observet, standard_residuals)
dim(h_observet_stdres)
## [1] 10548     3
h_observet_analysis <- h_observet_stdres[h_observet_stdres$standard_residuals < 2,]
dim(h_observet_analysis)
## [1] 10295     3
## Summary of Linear Regression Analysis: the p-value for the slope coefficient is much higher than 0.05, so there is no statistical evidence that there is a linear relationship between suspect heights and observed time
summary(lm(OBSERVED_DURATION_MINUTES ~ SUSPECT_HEIGHT, data = h_observet_analysis))
## 
## Call:
## lm(formula = OBSERVED_DURATION_MINUTES ~ SUSPECT_HEIGHT, data = h_observet_analysis)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5806 -0.5625 -0.5522  0.4271  8.4736 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.40768    0.25068   5.615 2.01e-08 ***
## SUSPECT_HEIGHT  0.02581    0.04434   0.582     0.56    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.733 on 10293 degrees of freedom
## Multiple R-squared:  3.293e-05,  Adjusted R-squared:  -6.423e-05 
## F-statistic: 0.3389 on 1 and 10293 DF,  p-value: 0.5605
### Linear Regression 4: Stop Time VS. Suspect Weight


## First roughly observe the linear relationship between Stop Mins and Suspect Weight
# scatter plot of suspect weight(x) vs. stop time(y), done above -- it is not linear by observation
ggplot(NYPD, aes(SUSPECT_WEIGHT, STOP_DURATION_MINUTES)) + geom_point() + labs(title="Relation between Suspect Weight and Stop Time ", y="stop duration mins", x="suspect weight")

# Here we use the same way as previously in part 1 to define and exclude outliers based on context; outliers: stop mins >50
NYPD$SUSPECT_WEIGHT <- as.numeric(NYPD$SUSPECT_WEIGHT)
## Warning: NAs introduced by coercion
sample_w_stopt <- NYPD[!is.na(NYPD$SUSPECT_WEIGHT) & NYPD$STOP_DURATION_MINUTES<=50,]

# calculate the correlation -- seems a very weak negative correlation
corr_w_stopt <- cor(sample_w_stopt$SUSPECT_WEIGHT, sample_w_stopt$STOP_DURATION_MINUTES)
print(paste("correlation between suspect weight and stop mins is",corr_w_stopt))
## [1] "correlation between suspect weight and stop mins is -0.0212103415853416"
# generate a line of best fit -- the line is downward light sloping
ggplot(sample_w_stopt, aes(SUSPECT_WEIGHT, STOP_DURATION_MINUTES)) + geom_point() + geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'

## Build the linear regression model and check whether the residuals fit the assumptions
w_stopt_model <- lm(STOP_DURATION_MINUTES ~ SUSPECT_WEIGHT, data = sample_w_stopt)
print(w_stopt_model)
## 
## Call:
## lm(formula = STOP_DURATION_MINUTES ~ SUSPECT_WEIGHT, data = sample_w_stopt)
## 
## Coefficients:
##    (Intercept)  SUSPECT_WEIGHT  
##      10.946666       -0.005192
# generate histogram, normal probability plot of residuals and a residual plot -- the histogram is not normally distributed, the residuals are not very randomly scattered, and the normal probability plot is not approximately a straight line, so the linear regression model may not be appropriate in this case.
residuals<-resid(w_stopt_model)
hist(residuals)

qqnorm(residuals)

plot(fitted(w_stopt_model), residuals)

# Calculate and plot the standard residuals
standard_residuals <- rstandard(w_stopt_model)
plot(fitted(w_stopt_model), standard_residuals)

## Data Extraction: exclude outliers whose standard residuals are greater than 2:
w_stopt <- sample_w_stopt[c("SUSPECT_WEIGHT", "STOP_DURATION_MINUTES")]
w_stopt_stdres <- cbind(w_stopt, standard_residuals)
dim(w_stopt_stdres)
## [1] 10451     3
w_stopt_analysis <- w_stopt_stdres[w_stopt_stdres$standard_residuals < 2,]
dim(w_stopt_analysis)
## [1] 9806    3
## Summary of Linear Regression Analysis: the p-value for the slope coefficient is 0.04, which is less than 0.05, so we could reject the null hypothesis here in favor of alternative hypothesis. That is, there is statistical significance that suspect weight is negatively linear to stop time.
summary(lm(STOP_DURATION_MINUTES ~ SUSPECT_WEIGHT, data = w_stopt_analysis))
## 
## Call:
## lm(formula = STOP_DURATION_MINUTES ~ SUSPECT_WEIGHT, data = w_stopt_analysis)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.671 -3.568 -1.568  1.706 18.603 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     9.047370   0.289084  31.297   <2e-16 ***
## SUSPECT_WEIGHT -0.003423   0.001674  -2.045   0.0409 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.793 on 9804 degrees of freedom
## Multiple R-squared:  0.0004263,  Adjusted R-squared:  0.0003243 
## F-statistic: 4.181 on 1 and 9804 DF,  p-value: 0.0409
## Chi-square Test 1: Frisked Flag VS. Searched Flag


# Null hypothesis: There is no association between category A and category B
# Alternative hypothesis: There is an association between category A and category B

#install.packages("dplyr")
#library(dplyr)

#Select the variables we want to analyze: FRISKED_FLAG, SEARCHED_FLAG
analysis_vars <- NYPD %>%
       select(FRISKED_FLAG, SEARCHED_FLAG) %>%
       filter(FRISKED_FLAG %in% c("0", "1"), SEARCHED_FLAG %in% c("0", "1"))

head(analysis_vars)
##   FRISKED_FLAG SEARCHED_FLAG
## 1            1             0
## 2            1             1
## 3            1             0
## 4            1             1
## 5            1             1
## 6            1             0
#Let's investigate to see if there is an association between FRISKED_FLAG and SEARCHED_FLAG
frisked_searched <- table(analysis_vars$FRISKED_FLAG, analysis_vars$SEARCHED_FLAG)
print(frisked_searched)
##    
##        0    1
##   0 3444 1045
##   1 3896 2623
#Conduct a Chi-Squared Test of Independence
chisq.test(analysis_vars$FRISKED_FLAG, analysis_vars$SEARCHED_FLAG)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  analysis_vars$FRISKED_FLAG and analysis_vars$SEARCHED_FLAG
## X-squared = 343.28, df = 1, p-value < 2.2e-16
## As we can see, the p-value here is greatly larger than 0.05. Thus, we cannot refuse the Null hypothesis, there is no association between frisked flag and searched flag.
## Chi-square Test 2: Officer in Uniform Flag VS. Suspect Arrested Flag

#Select the variables we want to analyze: OFFICER_IN_UNIFORM_FLAG, SUSPECT_ARRESTED_FLAG
analysis_vars2 <- NYPD %>%
       select(OFFICER_IN_UNIFORM_FLAG, SUSPECT_ARRESTED_FLAG) %>%
       filter(OFFICER_IN_UNIFORM_FLAG %in% c("0", "1"), SUSPECT_ARRESTED_FLAG %in% c("0", "1"))

head(analysis_vars2)
##   OFFICER_IN_UNIFORM_FLAG SUSPECT_ARRESTED_FLAG
## 1                       1                     0
## 2                       1                     0
## 3                       1                     0
## 4                       1                     1
## 5                       1                     1
## 6                       1                     0
#Let's investigate to see if there is an association between OFFICER_IN_UNIFORM_FLAG and SUSPECT_ARRESTED_FLAG
uniform_arrested <- table(analysis_vars2$OFFICER_IN_UNIFORM_FLAG, analysis_vars2$SUSPECT_ARRESTED_FLAG)
print(uniform_arrested)
##    
##        0    1
##   0 1909  959
##   1 5984 2156
#Conduct a Chi-Squared Test of Independence
chisq.test(analysis_vars2$OFFICER_IN_UNIFORM_FLAG, analysis_vars2$SUSPECT_ARRESTED_FLAG)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  analysis_vars2$OFFICER_IN_UNIFORM_FLAG and analysis_vars2$SUSPECT_ARRESTED_FLAG
## X-squared = 50.166, df = 1, p-value = 1.413e-12
## As we can see, the p-value here is also larger than 0.05. Thus, we cannot refuse the Null hypothesis, there is no association between the officer in uniform flag and the suspect arrested flag.
### Comparison of Means T-test 1: mean observed duration time of 2 eye colors (a common eye color -- Brown and an uncommon eye color -- Blue)

## how we define common and uncommon: 9024 observations of "brown eye", 157 observations of "blue eye"
unique(NYPD$SUSPECT_EYE_COLOR)
##  [1] "BRO" "BLK" "GRY" "HAZ" "ZZZ" "BLU" "0"   "GRN" "MUL" "OTH"
table(NYPD$SUSPECT_EYE_COLOR)
## 
##    0  BLK  BLU  BRO  GRN  GRY  HAZ  MUL  OTH  ZZZ 
##  210 1025  157 9024   98   21   91    5    2  375
## subset the vectors that we are going to do the t-test
# We use the same method as previously in part 1 to exclude outliers in observed time(>35). We have to exclude that before the test, otherwise the mean would be severely raised by the outliers
sample_eye_observet <- NYPD %>%
       select(OBSERVED_DURATION_MINUTES, SUSPECT_EYE_COLOR) %>%
       filter(SUSPECT_EYE_COLOR %in% c("BRO", "BLU"), OBSERVED_DURATION_MINUTES<=35)


## Calculate the mean observed time by eye color
mean_observet_by_eyecolor <- sample_eye_observet %>%
       group_by(SUSPECT_EYE_COLOR) %>%
       summarize(mean_observet = mean(OBSERVED_DURATION_MINUTES, na.rm=TRUE))
print(mean_observet_by_eyecolor)
## # A tibble: 2 × 2
##   SUSPECT_EYE_COLOR mean_observet
##   <chr>                     <dbl>
## 1 BLU                        2.29
## 2 BRO                        2.10
## Create a comparative box plot of Observed Duration Minutes by Eye Color
ggplot(sample_eye_observet, aes(OBSERVED_DURATION_MINUTES, SUSPECT_EYE_COLOR)) + geom_boxplot(colour = "blue", fill="grey") + ggtitle("Observed Time by Eye Color")

## Conduct a comparison of means t-test:
t.test(OBSERVED_DURATION_MINUTES~SUSPECT_EYE_COLOR, data=sample_eye_observet, equal.var=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  OBSERVED_DURATION_MINUTES by SUSPECT_EYE_COLOR
## t = 0.54239, df = 160.81, p-value = 0.5883
## alternative hypothesis: true difference in means between group BLU and group BRO is not equal to 0
## 95 percent confidence interval:
##  -0.5004594  0.8794577
## sample estimates:
## mean in group BLU mean in group BRO 
##          2.292994          2.103494
# From the t-test result we can see that the p-value is much higher than 0.05 which implies there is no statistical significant difference in observed time means of the 2 different eye colors "blue" and "brown"; the 95% confidence intervals also include 0(no difference). We cannot reject null hypothesis.
### Comparison of Means T-test 2: mean stop duration time of 2 eye colors(a common eye color -- Brown and an uncommon eye color -- Blue)

## how we define common and uncommon: 9024 observations of "brown eye", 157 observations of "blue eye"
unique(NYPD$SUSPECT_EYE_COLOR)
##  [1] "BRO" "BLK" "GRY" "HAZ" "ZZZ" "BLU" "0"   "GRN" "MUL" "OTH"
table(NYPD$SUSPECT_EYE_COLOR)
## 
##    0  BLK  BLU  BRO  GRN  GRY  HAZ  MUL  OTH  ZZZ 
##  210 1025  157 9024   98   21   91    5    2  375
## subset the vectors that we are going to do the t-test
# We use the same method as previously in part 1 to exclude outliers in stop time(>50). We have to exclude that before the test, otherwise the mean would be severely raised by the outliers
sample_eye_stopt <- NYPD %>%
       select(STOP_DURATION_MINUTES, SUSPECT_EYE_COLOR) %>%
       filter(SUSPECT_EYE_COLOR %in% c("BRO", "BLU"), STOP_DURATION_MINUTES<=50)


## Calculate the mean stop time by eye color
mean_stopt_by_eyecolor <- sample_eye_stopt %>%
       group_by(SUSPECT_EYE_COLOR) %>%
       summarize(mean_stopt = mean(STOP_DURATION_MINUTES, na.rm=TRUE))
print(mean_stopt_by_eyecolor)
## # A tibble: 2 × 2
##   SUSPECT_EYE_COLOR mean_stopt
##   <chr>                  <dbl>
## 1 BLU                     10.9
## 2 BRO                     10.1
## Create a comparative box plot of Stop Duration Minutes by Eye Color
ggplot(sample_eye_stopt, aes(STOP_DURATION_MINUTES, SUSPECT_EYE_COLOR)) + geom_boxplot(colour = "blue", fill="grey") + ggtitle("Stop Time by Eye Color")

## Conduct a comparison of means t-test:
t.test(STOP_DURATION_MINUTES~SUSPECT_EYE_COLOR, data=sample_eye_stopt, equal.var=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  STOP_DURATION_MINUTES by SUSPECT_EYE_COLOR
## t = 1.2761, df = 160.74, p-value = 0.2038
## alternative hypothesis: true difference in means between group BLU and group BRO is not equal to 0
## 95 percent confidence interval:
##  -0.4737946  2.2043389
## sample estimates:
## mean in group BLU mean in group BRO 
##          10.92308          10.05780
# From the t-test result we can see that p-value is higher than 0.05. This implies that there is no statistical significant difference in observed time means of the 2 different eye colors "blue" and "brown"; the 95% confidence intervals also include 0(no difference). We cannot reject null hypothesis.
### Comparison of Means T-test 3: mean observed duration time of thin suspects and heavy suspects (body build type)

unique(NYPD$SUSPECT_BODY_BUILD_TYPE)
## [1] "MED" "HEA" "THN" "XXX" "U"   "0"
table(NYPD$SUSPECT_BODY_BUILD_TYPE) # Thin suspects are far more than heavy suspects.
## 
##    0  HEA  MED  THN    U  XXX 
##  148 1056 4107 5318  331   48
# We will use all the data to conduct the test first, and then we do again without outliers(observed mins >35)

## subset the vectors that we are going to do the t-test
all_sample_body_observet <- NYPD %>%
       select(OBSERVED_DURATION_MINUTES, SUSPECT_BODY_BUILD_TYPE) %>%
       filter(SUSPECT_BODY_BUILD_TYPE %in% c("HEA", "THN"))


## Calculate the mean observed time by body build type
all_mean_observet_by_body <- all_sample_body_observet %>%
       group_by(SUSPECT_BODY_BUILD_TYPE) %>%
       summarize(all_mean_observet = mean(OBSERVED_DURATION_MINUTES, na.rm=TRUE))
print(all_mean_observet_by_body)
## # A tibble: 2 × 2
##   SUSPECT_BODY_BUILD_TYPE all_mean_observet
##   <chr>                               <dbl>
## 1 HEA                                  11.0
## 2 THN                                  33.0
## Create a comparative box plot of Observed Duration Minutes by Body Build Type
ggplot(all_sample_body_observet, aes(OBSERVED_DURATION_MINUTES, SUSPECT_BODY_BUILD_TYPE)) + geom_boxplot(colour = "orange", fill="grey") + ggtitle("Observed Time by Body Build Type")

## Conduct a comparison of means t-test:
t.test(OBSERVED_DURATION_MINUTES~SUSPECT_BODY_BUILD_TYPE, data=all_sample_body_observet, equal.var=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  OBSERVED_DURATION_MINUTES by SUSPECT_BODY_BUILD_TYPE
## t = -1.1141, df = 5673.9, p-value = 0.2653
## alternative hypothesis: true difference in means between group HEA and group THN is not equal to 0
## 95 percent confidence interval:
##  -60.83788  16.74712
## sample estimates:
## mean in group HEA mean in group THN 
##          10.99148          33.03686
# From the t-test result we can see that the p-value is much higher than 0.05 which implies there is no statistical significant difference in observed time means of thin suspects and heavy suspects(with all data including outliers). Even though we could see obvious difference in the computed means(Thin 33.04 mins, Heavy 10.99 mins), we cannot conclude the mean observed time is different in these 2 groups.





# Then do it again excluding outliers: we use the same method as previously in part 1 to exclude outliers in observed time(>35), which is defined according to context. 
sample_body_observet <- NYPD %>%
       select(OBSERVED_DURATION_MINUTES, SUSPECT_BODY_BUILD_TYPE) %>%
       filter(SUSPECT_BODY_BUILD_TYPE %in% c("HEA", "THN"), OBSERVED_DURATION_MINUTES<=35)


mean_observet_by_body <- sample_body_observet %>%
       group_by(SUSPECT_BODY_BUILD_TYPE) %>%
       summarize(mean_observet = mean(OBSERVED_DURATION_MINUTES, na.rm=TRUE))
print(mean_observet_by_body)
## # A tibble: 2 × 2
##   SUSPECT_BODY_BUILD_TYPE mean_observet
##   <chr>                           <dbl>
## 1 HEA                              1.96
## 2 THN                              1.96
ggplot(sample_body_observet, aes(OBSERVED_DURATION_MINUTES, SUSPECT_BODY_BUILD_TYPE)) + geom_boxplot(colour = "orange", fill="grey") + ggtitle("Observed Time by Body Build Type")

t.test(OBSERVED_DURATION_MINUTES~SUSPECT_BODY_BUILD_TYPE, data=sample_body_observet, equal.var=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  OBSERVED_DURATION_MINUTES by SUSPECT_BODY_BUILD_TYPE
## t = 0.023353, df = 1448, p-value = 0.9814
## alternative hypothesis: true difference in means between group HEA and group THN is not equal to 0
## 95 percent confidence interval:
##  -0.2395527  0.2453253
## sample estimates:
## mean in group HEA mean in group THN 
##          1.961796          1.958909
# The p-value is much higher this time(0.98 >>> 0.05), we are still rejecting the null hypothesis. From the 2 tests above, we could conclude there is no statistical significant difference in observed duration minutes of thin and heavy suspects.
### Comparison of Means T-test 4: mean stop duration time of thin suspects and heavy suspects (body build type)

unique(NYPD$SUSPECT_BODY_BUILD_TYPE)
## [1] "MED" "HEA" "THN" "XXX" "U"   "0"
table(NYPD$SUSPECT_BODY_BUILD_TYPE) # Thin suspects are far more than heavy suspects.
## 
##    0  HEA  MED  THN    U  XXX 
##  148 1056 4107 5318  331   48
# We will use all the data to conduct the test first, and then we do again without outliers(stop mins >50)

## subset the vectors that we are going to do the t-test
all_sample_body_stopt <- NYPD %>%
       select(STOP_DURATION_MINUTES, SUSPECT_BODY_BUILD_TYPE) %>%
       filter(SUSPECT_BODY_BUILD_TYPE %in% c("HEA", "THN"))


## Calculate the mean stop time by body build type
all_mean_stopt_by_body <- all_sample_body_stopt %>%
       group_by(SUSPECT_BODY_BUILD_TYPE) %>%
       summarize(all_mean_stopt = mean(STOP_DURATION_MINUTES, na.rm=TRUE))
print(all_mean_stopt_by_body)
## # A tibble: 2 × 2
##   SUSPECT_BODY_BUILD_TYPE all_mean_stopt
##   <chr>                            <dbl>
## 1 HEA                               10.9
## 2 THN                               12.2
## Create a comparative box plot of Stop Duration Minutes by Body Build Type
ggplot(all_sample_body_stopt, aes(STOP_DURATION_MINUTES, SUSPECT_BODY_BUILD_TYPE)) + geom_boxplot(colour = "orange", fill="grey") + ggtitle("Stop Time by Body Build Type")

## Conduct a comparison of means t-test:
t.test(STOP_DURATION_MINUTES~SUSPECT_BODY_BUILD_TYPE, data=all_sample_body_stopt, equal.var=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  STOP_DURATION_MINUTES by SUSPECT_BODY_BUILD_TYPE
## t = -2.4166, df = 2772.8, p-value = 0.01573
## alternative hypothesis: true difference in means between group HEA and group THN is not equal to 0
## 95 percent confidence interval:
##  -2.2316900 -0.2323622
## sample estimates:
## mean in group HEA mean in group THN 
##          10.92330          12.15532
# From the t-test result we see, p-value is smaller than 0.05 which implies there is statistical significant difference in stop time means of thin suspects and heavy suspects(with all data including outliers).





# Then do it again excluding outliers: we use the same method as previously in part 1 to exclude outliers in stop time(>50), which is defined according to context. 
sample_body_stopt <- NYPD %>%
       select(STOP_DURATION_MINUTES, SUSPECT_BODY_BUILD_TYPE) %>%
       filter(SUSPECT_BODY_BUILD_TYPE %in% c("HEA", "THN"), STOP_DURATION_MINUTES<=50)


mean_stopt_by_body <- sample_body_stopt %>%
       group_by(SUSPECT_BODY_BUILD_TYPE) %>%
       summarize(mean_stopt = mean(STOP_DURATION_MINUTES, na.rm=TRUE))
print(mean_stopt_by_body)
## # A tibble: 2 × 2
##   SUSPECT_BODY_BUILD_TYPE mean_stopt
##   <chr>                        <dbl>
## 1 HEA                           9.98
## 2 THN                          10.1
ggplot(sample_body_stopt, aes(STOP_DURATION_MINUTES, SUSPECT_BODY_BUILD_TYPE)) + geom_boxplot(colour = "orange", fill="grey") + ggtitle("Stop Time by Body Build Type")

t.test(STOP_DURATION_MINUTES~SUSPECT_BODY_BUILD_TYPE, data=sample_body_stopt, equal.var=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  STOP_DURATION_MINUTES by SUSPECT_BODY_BUILD_TYPE
## t = -0.27834, df = 1478, p-value = 0.7808
## alternative hypothesis: true difference in means between group HEA and group THN is not equal to 0
## 95 percent confidence interval:
##  -0.6441890  0.4840885
## sample estimates:
## mean in group HEA mean in group THN 
##          9.980788         10.060838
# The p-value is much higher this time. Thus, we are rejecting the null hypothesis.
### Comparison of Means T-test 5: mean observed duration time of male suspects and female suspects (sex)

unique(NYPD$SUSPECT_SEX)
## [1] "MALE"   "FEMALE" "0"
table(NYPD$SUSPECT_SEX) # Male suspects are far more than female suspects.
## 
##      0 FEMALE   MALE 
##     69   1014   9925
# We will use all the data to conduct the test first, and then we do again without outliers(observed mins >35)

## subset the vectors that we are going to do the t-test
all_sample_sex_observet <- NYPD %>%
       select(OBSERVED_DURATION_MINUTES, SUSPECT_SEX) %>%
       filter(SUSPECT_SEX %in% c("MALE", "FEMALE"))


## Calculate the mean observed time by body build type
all_mean_observet_by_sex <- all_sample_sex_observet %>%
       group_by(SUSPECT_SEX) %>%
       summarize(all_mean_observet = mean(OBSERVED_DURATION_MINUTES, na.rm=TRUE))
print(all_mean_observet_by_sex)
## # A tibble: 2 × 2
##   SUSPECT_SEX all_mean_observet
##   <chr>                   <dbl>
## 1 FEMALE                   6.79
## 2 MALE                    23.3
## Create a comparative box plot of Observed Duration Minutes by Body Build Type
ggplot(all_sample_sex_observet, aes(OBSERVED_DURATION_MINUTES, SUSPECT_SEX)) + geom_boxplot(colour = "orange", fill="grey") + ggtitle("Observed Time by Sex")

## Conduct a comparison of means t-test:
t.test(OBSERVED_DURATION_MINUTES~SUSPECT_SEX, data=all_sample_sex_observet, equal.var=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  OBSERVED_DURATION_MINUTES by SUSPECT_SEX
## t = -1.5454, df = 10611, p-value = 0.1223
## alternative hypothesis: true difference in means between group FEMALE and group MALE is not equal to 0
## 95 percent confidence interval:
##  -37.434521   4.428841
## sample estimates:
## mean in group FEMALE   mean in group MALE 
##             6.793886            23.296725
# From the t-test result we see that the p-value is still higher than 0.05, which implies there is no statistical significant difference in observed time means of male suspects and female suspects(with all data including outliers). Even though we could see obvious difference in the computed means(Female 6.79 mins, Male 23.30 mins), we cannot conclude the mean observed time is different in these 2 groups.





# Then do it again excluding outliers: we use the same method as previously in part 1 to exclude outliers in observed time(>35), which is defined according to context. 
sample_sex_observet <- NYPD %>%
       select(OBSERVED_DURATION_MINUTES, SUSPECT_SEX) %>%
       filter(SUSPECT_SEX %in% c("MALE", "FEMALE"), OBSERVED_DURATION_MINUTES<=35)


mean_observet_by_sex <- sample_sex_observet %>%
       group_by(SUSPECT_SEX) %>%
       summarize(mean_observet = mean(OBSERVED_DURATION_MINUTES, na.rm=TRUE))
print(mean_observet_by_sex)
## # A tibble: 2 × 2
##   SUSPECT_SEX mean_observet
##   <chr>               <dbl>
## 1 FEMALE               3.02
## 2 MALE                 2.01
ggplot(sample_sex_observet, aes(OBSERVED_DURATION_MINUTES, SUSPECT_SEX)) + geom_boxplot(colour = "orange", fill="grey") + ggtitle("Observed Time by Sex")

t.test(OBSERVED_DURATION_MINUTES~SUSPECT_SEX, data=sample_sex_observet, equal.var=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  OBSERVED_DURATION_MINUTES by SUSPECT_SEX
## t = 4.9617, df = 1080.2, p-value = 8.115e-07
## alternative hypothesis: true difference in means between group FEMALE and group MALE is not equal to 0
## 95 percent confidence interval:
##  0.6127442 1.4144164
## sample estimates:
## mean in group FEMALE   mean in group MALE 
##             3.020792             2.007212
# The p-value is much higher this time, we are still rejecting the null hypothesis. From the 2 tests above, we could conclude there is no statistical significant difference in observed duration minutes of thin and heavy suspects.
### Comparison of Means T-test 6: mean stop duration time of male suspects and female suspects (sex)

unique(NYPD$SUSPECT_SEX)
## [1] "MALE"   "FEMALE" "0"
table(NYPD$SUSPECT_SEX) # Male suspects are far more than female suspects.
## 
##      0 FEMALE   MALE 
##     69   1014   9925
# We will use all the data to conduct the test first, and then we do again without outliers(stop mins >50)

## subset the vectors that we are going to do the t-test
all_sample_sex_stopt <- NYPD %>%
       select(STOP_DURATION_MINUTES, SUSPECT_SEX) %>%
       filter(SUSPECT_SEX %in% c("MALE", "FEMALE"))


## Calculate the mean stop time by body build type
all_mean_stopt_by_sex <- all_sample_sex_stopt %>%
       group_by(SUSPECT_SEX) %>%
       summarize(all_mean_stopt = mean(STOP_DURATION_MINUTES, na.rm=TRUE))
print(all_mean_stopt_by_sex)
## # A tibble: 2 × 2
##   SUSPECT_SEX all_mean_stopt
##   <chr>                <dbl>
## 1 FEMALE                13.5
## 2 MALE                  11.4
## Create a comparative box plot of Stop Duration Minutes by Body Build Type
ggplot(all_sample_sex_stopt, aes(STOP_DURATION_MINUTES, SUSPECT_SEX)) + geom_boxplot(colour = "orange", fill="grey") + ggtitle("Stop Time by Sex")

## Conduct a comparison of means t-test:
t.test(STOP_DURATION_MINUTES~SUSPECT_SEX, data=all_sample_sex_stopt, equal.var=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  STOP_DURATION_MINUTES by SUSPECT_SEX
## t = 4.5599, df = 1516.6, p-value = 5.528e-06
## alternative hypothesis: true difference in means between group FEMALE and group MALE is not equal to 0
## 95 percent confidence interval:
##  1.193524 2.995497
## sample estimates:
## mean in group FEMALE   mean in group MALE 
##             13.47436             11.37985
# From the t-test result we see that the p-value is much higher than 0.05 which implies there is no statistical significant difference in stop time means of male suspects and female suspects(with all data including outliers).





# Then do it again excluding outliers: we use the same method as previously in part 1 to exclude outliers in stop time(>50), which is defined according to context. 
sample_sex_stopt <- NYPD %>%
       select(STOP_DURATION_MINUTES, SUSPECT_SEX) %>%
       filter(SUSPECT_SEX %in% c("MALE", "FEMALE"), STOP_DURATION_MINUTES<=50)


mean_stopt_by_sex <- sample_sex_stopt %>%
       group_by(SUSPECT_SEX) %>%
       summarize(mean_stopt = mean(STOP_DURATION_MINUTES, na.rm=TRUE))
print(mean_stopt_by_sex)
## # A tibble: 2 × 2
##   SUSPECT_SEX mean_stopt
##   <chr>            <dbl>
## 1 FEMALE           12.2 
## 2 MALE              9.86
ggplot(sample_sex_stopt, aes(STOP_DURATION_MINUTES, SUSPECT_SEX)) + geom_boxplot(colour = "orange", fill="grey") + ggtitle("Stop Time by Sex")

t.test(STOP_DURATION_MINUTES~SUSPECT_SEX, data=sample_sex_stopt, equal.var=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  STOP_DURATION_MINUTES by SUSPECT_SEX
## t = 7.4806, df = 1150.1, p-value = 1.464e-13
## alternative hypothesis: true difference in means between group FEMALE and group MALE is not equal to 0
## 95 percent confidence interval:
##  1.746983 2.989192
## sample estimates:
## mean in group FEMALE   mean in group MALE 
##            12.230847             9.862759
# We are still rejecting the null hypothesis because the p-value is higher than 0.05. From the 2 tests above, we could conclude there is no statistical significant difference in stop duration minutes of male and female suspects.